The objective of this problem set is to put together everything you have learned this semester. For this reason, the problem set has some parts that are a bit open ended. In these parts, you will need to use your judgment as to what is the best way to do something (make sure you make explicit your reasoning!). In other parts, the questions will be precise (and hopefully your answer will be so as well).
In this problem set, you will work with data collected from Airbnb by a third party. The objective of the problem set, ultimately, is to learn about Airbnb listings, whether Airbnb hosts are operate as single- or multi-home managers, identify locations in which Airbnb listings are concentrated, which listings compete with each other, etc.
This is a relatively long problem set. For this reason, please do not wait until the last moment to work on it. Doing so will ensure that you won't be able to finish. Because the semester is ending, I will not be able to provide extensions unless these follow University rules.
Organize your workflow. Create a directory dedicated to this problem set. Inside this directory, create directories called RawData, ProcessedData, Code, and Figures. You will store the data you get from the Internet in the RawData directory. Intermediate and final datasets created by you will go to ProcessedData. The code you write will be saved to Code and the figures you produced will be saved to Figures. However, you will turn in your Jupyter Notebook so make sure that all the output is both displayed on the notebook and saved to the relevant directories.
Go to http://insideairbnb.com/:
Data and to Get the data. You will see that there are many datasets from many different cities.listings.csv.gz for the city that you chose. Download this file to your computer and save it in RawData. Extract to the same directory the file that is inside the compressed file.problem_set_5_lastname_firstname.ipynb and save it to Code.I'm going to be vague here but there are several things you should do. Doing these, however, is the bare minimum and will ensure you get 60 percent of the points of this section. You should think carefully about other statistics that may be of interest.
id, name,host_id,host_since, host_location,host_acceptance_rate,host_is_superhost, host_neighbourhood, neighbourhood_cleansed,latitude,longitude, property_type, room_type, accommodates, bathrooms,bedrooms, beds, price,minimum_nights, maximum_nights, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location, review_scores_value. Therefore, make sure you drop all other columns. If these covariates do not exist for the city you chose, choose a different city! If you find a covariate that I didn't add to the list and that you consider relevant for the analysis, add it and make sure you tell me about it and why you want to include it.review_scores_accuracy, review_scores_communication, review_scores_location, review_scores_value, minimum_nights, maximum_nights.# In this cell, import all the relevant modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from shapely.geometry import Polygon, Point, LineString
import folium
import requests
import geopandas as gpd
#desktop path
rel_path = r'C:\\Users\\Hunter\\Documents\\econ471\\problemset5\\'
in_path = r'C:\\Users\\Hunter\\Documents\\econ471\\problemset5\\rawdata'
in_path = rel_path + r'\\rawdata\\'
out_path = rel_path + r'\\processeddata\\'
out_pathfig = rel_path + r'\\figures\\'
# In this cell, define the relevant paths
rel_path = r'C:\\Users\\Keesu\\Documents\\econ471\\problemset5\\'
in_path = r'C:\\Users\Hunter\Documents\econ471\problemset5\rawdata'
in_path = rel_path + r'\\rawdata\\'
out_path = rel_path + r'\\processeddata\\'
out_pathfig = rel_path + r'\\figures\\'
# In this cell, define the name of the file you will import
fname = 'listings.csv'
df = pd.read_csv(in_path + fname, keep_default_na=True)
df = pd.DataFrame(df)
# In this cell, read in the data
df.head(2290)
| id | listing_url | scrape_id | last_scraped | source | name | description | neighborhood_overview | picture_url | host_id | ... | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2384 | https://www.airbnb.com/rooms/2384 | 20230912032549 | 2023-09-12 | city scrape | Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... | You are invited stay in the guest room of my v... | The apartment is less than one block from beau... | https://a0.muscache.com/pictures/acf6b3c0-47f2... | 2613 | ... | 4.99 | 4.96 | 4.94 | R17000015609 | f | 1 | 0 | 1 | 0 | 2.15 |
| 1 | 7126 | https://www.airbnb.com/rooms/7126 | 20230912032549 | 2023-09-12 | city scrape | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | A very small studio in a wonderful neighborhoo... | Ukrainian Village was just named "Hottest Neig... | https://a0.muscache.com/pictures/51073/16c81c7... | 17928 | ... | 4.87 | 4.89 | 4.75 | R21000075737 | f | 1 | 1 | 0 | 0 | 2.92 |
| 2 | 10945 | https://www.airbnb.com/rooms/10945 | 20230912032549 | 2023-09-12 | city scrape | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | Beautiful first floor apartment in Historic Ol... | NaN | https://a0.muscache.com/pictures/58d1a420-a24b... | 33004 | ... | 4.82 | 4.99 | 4.65 | 2209984 | t | 7 | 7 | 0 | 0 | 0.63 |
| 3 | 12140 | https://www.airbnb.com/rooms/12140 | 20230912032549 | 2023-09-12 | city scrape | Boutique hotel in Chicago · ★4.93 · 1 bedroom ... | Bed and Breakfast license issued by the City o... | The Guest House is in the heart of Chicago's m... | https://a0.muscache.com/pictures/miso/Hosting-... | 46734 | ... | 5.00 | 5.00 | 4.87 | R20000055258 | f | 1 | 0 | 1 | 0 | 0.15 |
| 4 | 24833 | https://www.airbnb.com/rooms/24833 | 20230912032549 | 2023-09-12 | city scrape | Rental unit in Chicago · ★4.29 · 1 bedroom · 1... | Perfect private entrance apartment one block f... | Lincoln Park is a great neighborhood where peo... | https://a0.muscache.com/pictures/ac17be6f-40bd... | 101521 | ... | 4.03 | 4.90 | 4.20 | City registration pending | f | 4 | 4 | 0 | 0 | 0.26 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2285 | 37091430 | https://www.airbnb.com/rooms/37091430 | 20230912032549 | 2023-09-12 | city scrape | Condo in Chicago · ★4.83 · 2 bedrooms · 3 beds... | Brand New Private Modern 2 bd 2bth apt for you... | This neighborhood is alive! Pilsen has a vinta... | https://a0.muscache.com/pictures/73e3f383-b4d2... | 88547320 | ... | 4.95 | 4.72 | 4.83 | R20000052495 | f | 10 | 10 | 0 | 0 | 5.51 |
| 2286 | 37107960 | https://www.airbnb.com/rooms/37107960 | 20230912032549 | 2023-09-12 | city scrape | Rental unit in Chicago · 1 bedroom · 1 bed · 1... | Show up and start living from day one in Chica... | This furnished rental is in the historic Gold ... | https://a0.muscache.com/pictures/prohost-api/H... | 107434423 | ... | NaN | NaN | NaN | 32+ Days Listing | f | 645 | 645 | 0 | 0 | NaN |
| 2287 | 37113529 | https://www.airbnb.com/rooms/37113529 | 20230912032549 | 2023-09-12 | city scrape | Condo in Chicago · ★4.80 · 3 bedrooms · 5 beds... | My Chicago themed home is a perfect retreat on... | For a feel of the youthful, creative energy on... | https://a0.muscache.com/pictures/prohost-api/H... | 11981271 | ... | 4.95 | 4.91 | 4.84 | R22000081935 | t | 3 | 3 | 0 | 0 | 1.70 |
| 2288 | 37125424 | https://www.airbnb.com/rooms/37125424 | 20230912032549 | 2023-09-12 | city scrape | Condo in Chicago · ★4.95 · 1 bedroom · 1 bed ·... | Our home is a lovely city terrace modern loft ... | The area hosts the Finest Restaurants, and co... | https://a0.muscache.com/pictures/pro_photo_too... | 8540884 | ... | 5.00 | 5.00 | 4.66 | R19000046100 | f | 1 | 1 | 0 | 0 | 0.85 |
| 2289 | 37143206 | https://www.airbnb.com/rooms/37143206 | 20230912032549 | 2023-09-12 | city scrape | Home in Chicago · ★5.0 · 1 bedroom · 1 bed · 1... | We are located by the I-290 and a 10 minute wa... | Ronak’s place is located in Chicago, Illinois,... | https://a0.muscache.com/pictures/e5c0bd38-ce61... | 110725131 | ... | 5.00 | 3.25 | 5.00 | R19000042272 | f | 5 | 0 | 5 | 0 | 0.08 |
2290 rows × 75 columns
We will restrict attention to only some variables. We'll drop variables that have too many missing values.
# In this cell restrict the dataframe to the columns identified above.
df = df[['id', 'name','host_id','host_since', 'host_location','host_acceptance_rate','host_is_superhost', 'host_neighbourhood', 'neighbourhood_cleansed','latitude','longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms','bedrooms', 'beds', 'price','minimum_nights', 'maximum_nights', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_communication', 'review_scores_location', 'review_scores_value']]
df= pd.DataFrame(df)
df.head(2890)
# If you decide to keep a variable not mentioned above, remember to tell me why you are doing it. I want to know your thought process!
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | beds | price | minimum_nights | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2384 | Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... | 2613 | 2008-08-29 | Chicago, IL | 92% | t | Hyde Park | Hyde Park | 41.787900 | ... | 1.0 | $114.00 | 3 | 60 | 227 | 4.99 | 4.98 | 4.99 | 4.96 | 4.94 |
| 1 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | 1.0 | $92.00 | 32 | 60 | 505 | 4.70 | 4.84 | 4.87 | 4.89 | 4.75 |
| 2 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | 2.0 | $170.00 | 4 | 1125 | 72 | 4.65 | 4.76 | 4.82 | 4.99 | 4.65 |
| 3 | 12140 | Boutique hotel in Chicago · ★4.93 · 1 bedroom ... | 46734 | 2009-10-18 | Chicago, IL | 100% | f | Lincoln Park | Lincoln Park | 41.923570 | ... | 1.0 | $329.00 | 2 | 10 | 15 | 4.93 | 4.67 | 5.00 | 5.00 | 4.87 |
| 4 | 24833 | Rental unit in Chicago · ★4.29 · 1 bedroom · 1... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.925961 | ... | 1.0 | $61.00 | 32 | 395 | 42 | 4.29 | 4.07 | 4.03 | 4.90 | 4.20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2885 | 42900122 | Rental unit in Chicago · ★4.91 · 3 bedrooms · ... | 63239107 | 2016-03-16 | Chicago, IL | 100% | t | NaN | Grand Boulevard | 41.812620 | ... | 4.0 | $172.00 | 1 | 21 | 56 | 4.91 | 4.95 | 5.00 | 4.71 | 4.84 |
| 2886 | 42910760 | Rental unit in Chicago · ★4.25 · Studio · 2 be... | 76244 | 2010-02-01 | Chicago, IL | 28% | f | Lincoln Park | Lincoln Park | 41.919310 | ... | 2.0 | $200.00 | 5 | 1125 | 20 | 4.25 | 4.65 | 4.85 | 4.90 | 4.10 |
| 2887 | 42920816 | Rental unit in Chicago · 1 bedroom · 1 bed · 1... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Near West Side | 41.883056 | ... | 1.0 | $170.00 | 32 | 1125 | 2 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
| 2888 | 42929197 | Home in Chicago · ★4.79 · 2 bedrooms · 2 beds ... | 22998917 | 2014-10-26 | Chicago, IL | 90% | f | Humboldt Park | Humboldt Park | 41.908940 | ... | 2.0 | $177.00 | 3 | 1125 | 62 | 4.79 | 4.89 | 4.92 | 4.73 | 4.77 |
| 2889 | 42933476 | Home in Chicago · ★4.93 · 1 bedroom · 1 bed · ... | 47056445 | 2015-10-20 | Chicago, IL | 85% | t | South Shore | South Shore | 41.762340 | ... | 1.0 | $12.00 | 11 | 1125 | 14 | 4.93 | 4.93 | 5.00 | 4.64 | 4.86 |
2890 rows × 26 columns
How many listings are there? How many hosts? Are all variables in the correct format?
# How many listings are there?
counts = len(df)
print(f"Number of listings in df: {counts}")
Number of listings in df: 8528
# How many hosts?
unique_hosts = df['host_id'].unique()
pd.DataFrame(unique_hosts, columns=['individual_host'])
#We can see that there are a total of 3834 different hosts. This means that there are a lot of hosts that own multiple units.
| individual_host | |
|---|---|
| 0 | 2613 |
| 1 | 17928 |
| 2 | 33004 |
| 3 | 46734 |
| 4 | 101521 |
| ... | ... |
| 3829 | 395141659 |
| 3830 | 415825853 |
| 3831 | 468284714 |
| 3832 | 97961430 |
| 3833 | 536514763 |
3834 rows × 1 columns
df.dtypes
id int64 name object host_id int64 host_since object host_location object host_acceptance_rate object host_is_superhost object host_neighbourhood object neighbourhood_cleansed object latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bedrooms float64 beds float64 price object minimum_nights int64 maximum_nights int64 number_of_reviews int64 review_scores_rating float64 review_scores_accuracy float64 review_scores_communication float64 review_scores_location float64 review_scores_value float64 dtype: object
price_cor_id = 14167586
new_price = '75'
#for some reason this one listing has the wrong price, they had it listed as 7580 instead of just 75, i went to
#the posting in the csv file
df.loc[df['id'] == price_cor_id, 'price'] = new_price
search_id = 14167586
result = df[df['id'] == search_id]
result
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | beds | price | minimum_nights | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 673 | 14167586 | Rental unit in Chicago · ★4.80 · 1 bedroom · 1... | 27232871 | 2015-02-04 | Chicago, IL | 100% | f | Ukrainian Village | West Town | 41.89863 | ... | 1.0 | 75 | 32 | 1125 | 213 | 4.8 | 4.85 | 4.86 | 4.92 | 4.86 |
1 rows × 26 columns
# Create a new variable called `df['price2']` that is equal to `price` but of the right type and with the right format.
df['price2'] = df['price']
df['price2'] = df['price2'].str.replace('$', '')
df['price2'] = df['price2'].str.replace(',', '')
df['price2'] = df['price2'].astype('float')
#you have to transfer as a float before an integer
df['price2'] = df['price2'].astype('Int64')
df.head(2290)
C:\Users\Hunter\AppData\Local\Temp\ipykernel_2176\934704436.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df['price2'] = df['price2'].str.replace('$', '')
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | price | minimum_nights | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | price2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2384 | Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... | 2613 | 2008-08-29 | Chicago, IL | 92% | t | Hyde Park | Hyde Park | 41.787900 | ... | $114.00 | 3 | 60 | 227 | 4.99 | 4.98 | 4.99 | 4.96 | 4.94 | 114 |
| 1 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | $92.00 | 32 | 60 | 505 | 4.70 | 4.84 | 4.87 | 4.89 | 4.75 | 92 |
| 2 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | $170.00 | 4 | 1125 | 72 | 4.65 | 4.76 | 4.82 | 4.99 | 4.65 | 170 |
| 3 | 12140 | Boutique hotel in Chicago · ★4.93 · 1 bedroom ... | 46734 | 2009-10-18 | Chicago, IL | 100% | f | Lincoln Park | Lincoln Park | 41.923570 | ... | $329.00 | 2 | 10 | 15 | 4.93 | 4.67 | 5.00 | 5.00 | 4.87 | 329 |
| 4 | 24833 | Rental unit in Chicago · ★4.29 · 1 bedroom · 1... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.925961 | ... | $61.00 | 32 | 395 | 42 | 4.29 | 4.07 | 4.03 | 4.90 | 4.20 | 61 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2285 | 37091430 | Condo in Chicago · ★4.83 · 2 bedrooms · 3 beds... | 88547320 | 2016-08-07 | NaN | 97% | t | Lower West Side | Lower West Side | 41.856750 | ... | $99.00 | 1 | 1000 | 263 | 4.83 | 4.90 | 4.95 | 4.72 | 4.83 | 99 |
| 2286 | 37107960 | Rental unit in Chicago · 1 bedroom · 1 bed · 1... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Near North Side | 41.898904 | ... | $126.00 | 32 | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 126 |
| 2287 | 37113529 | Condo in Chicago · ★4.80 · 3 bedrooms · 5 beds... | 11981271 | 2014-02-05 | Chicago, IL | 87% | t | West Town/Noble Square | West Town | 41.891286 | ... | $132.00 | 1 | 365 | 85 | 4.80 | 4.75 | 4.95 | 4.91 | 4.84 | 132 |
| 2288 | 37125424 | Condo in Chicago · ★4.95 · 1 bedroom · 1 bed ·... | 8540884 | 2013-08-30 | Chicago, IL | 100% | t | West Loop/Greektown | Near West Side | 41.886380 | ... | $253.00 | 10 | 365 | 41 | 4.95 | 4.95 | 5.00 | 5.00 | 4.66 | 253 |
| 2289 | 37143206 | Home in Chicago · ★5.0 · 1 bedroom · 1 bed · 1... | 110725131 | 2017-01-09 | Houston, TX | 43% | f | Garfield Park | West Garfield Park | 41.877450 | ... | $42.00 | 30 | 1125 | 4 | 5.00 | 4.75 | 5.00 | 3.25 | 5.00 | 42 |
2290 rows × 27 columns
df['host_acceptance_rate2'] = df['host_acceptance_rate'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('Int64')
df.dtypes
id int64 name object host_id int64 host_since object host_location object host_acceptance_rate object host_is_superhost object host_neighbourhood object neighbourhood_cleansed object latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bedrooms float64 beds float64 price object minimum_nights int64 maximum_nights int64 number_of_reviews int64 review_scores_rating float64 review_scores_accuracy float64 review_scores_communication float64 review_scores_location float64 review_scores_value float64 price2 Int64 host_acceptance_rate2 Int64 dtype: object
# Create a new variable called `df['host_acceptance_rate2']` that is equal to `host_acceptance_rate` but of the right type and with the right format.
df['host_acceptance_rate2'] = df['host_acceptance_rate']
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('float')
# Summary statistics of _relevant_ variables. Remember to first check that each relevant variable has the right format/is of the right type. Modify the variables if needed.
#i modified the other variables above so I can work on them here
# Create a new variable called `df['price2']` that is equal to `price` but of the right type and with the right format.
df['price2'] = df['price']
df['price2'] = df['price2'].str.replace('$', '')
df['price2'] = df['price2'].str.replace(',', '')
df['price2'] = df['price2'].astype('float')
#you have to transfer as a float before an integer
df['price2'] = df['price2'].astype('Int64')
#host acceptance rate
df['host_acceptance_rate2'] = df['host_acceptance_rate'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('Int64')
# Create a new variable called `df['host_is_superhost2']` that is equal to `host_is_superhost` but of the right type and with the right format.
df['host_is_superhost2'] = df['host_is_superhost'].replace({'f': 0, 't': 1}).astype(bool)
rel_var = df[['price2','host_acceptance_rate2', 'host_is_superhost2','bathrooms','beds', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy']]
rel_var.describe()
C:\Users\Hunter\AppData\Local\Temp\ipykernel_2176\1227629468.py:5: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df['price2'] = df['price2'].str.replace('$', '')
| price2 | host_acceptance_rate2 | bathrooms | beds | number_of_reviews | review_scores_rating | review_scores_accuracy | |
|---|---|---|---|---|---|---|---|
| count | 8528.0 | 7779.0 | 0.0 | 8482.000000 | 8528.000000 | 6864.000000 | 6852.000000 |
| mean | 198.283185 | 91.912585 | NaN | 2.245461 | 46.595333 | 4.739273 | 4.789812 |
| std | 255.263629 | 17.881487 | NaN | 1.705293 | 87.779889 | 0.454732 | 0.388242 |
| min | 12.0 | 0.0 | NaN | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 87.0 | 94.0 | NaN | 1.000000 | 1.000000 | 4.687500 | 4.750000 |
| 50% | 135.0 | 98.0 | NaN | 2.000000 | 14.000000 | 4.860000 | 4.900000 |
| 75% | 214.0 | 100.0 | NaN | 3.000000 | 54.000000 | 4.980000 | 5.000000 |
| max | 5000.0 | 100.0 | NaN | 24.000000 | 3332.000000 | 5.000000 | 5.000000 |
df.dtypes
id int64 name object host_id int64 host_since object host_location object host_acceptance_rate object host_is_superhost object host_neighbourhood object neighbourhood_cleansed object latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bedrooms float64 beds float64 price object minimum_nights int64 maximum_nights int64 number_of_reviews int64 review_scores_rating float64 review_scores_accuracy float64 review_scores_communication float64 review_scores_location float64 review_scores_value float64 price2 Int64 host_acceptance_rate2 Int64 host_is_superhost2 bool dtype: object
There are some columns we may want to transform to numeric format. For example, price, host_acceptance_rate, and host_is_superhost appear to be in a wrong format. Price, for example, is stored as strings (at least is not a mixed type variable!). We need to clean it and make it a numeric variable.
Something similar happens with host_acceptance_rate. Transform this variable into percentages (0.75, for example, instead of 75%).
# Create a new variable called `df['host_acceptance_rate2']` that is equal to `host_acceptance_rate` but of the right type and with the right format.
df['host_acceptance_rate2'] = df['host_acceptance_rate']
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].str.replace('%', '')
df['host_acceptance_rate2'] = df['host_acceptance_rate2'].astype('float')
host_is_superhost seems to be boolean but it is not. It is supposed to be boolean, with f for 0 and t for 1. Transform it to numeric, with 1 instead of t and 0 instead of f.
# Create a new variable called `df['host_is_superhost2']` that is equal to `host_is_superhost` but of the right type and with the right format.
df['host_is_superhost2'] = df['host_is_superhost'].replace({'f': 0, 't': 1}).astype(bool)
missing_values = df['bathrooms'].isna()
#every single row is missing baths, I should just drop it below
rows_with_missing_values = df[missing_values]
rows_with_missing_values
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | price2 | host_acceptance_rate2 | host_is_superhost2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2384 | Condo in Chicago · ★4.99 · 1 bedroom · 1 bed ·... | 2613 | 2008-08-29 | Chicago, IL | 92% | t | Hyde Park | Hyde Park | 41.787900 | ... | 60 | 227 | 4.99 | 4.98 | 4.99 | 4.96 | 4.94 | 114 | 92.0 | True |
| 1 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | 60 | 505 | 4.70 | 4.84 | 4.87 | 4.89 | 4.75 | 92 | 95.0 | True |
| 2 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | 1125 | 72 | 4.65 | 4.76 | 4.82 | 4.99 | 4.65 | 170 | 95.0 | True |
| 3 | 12140 | Boutique hotel in Chicago · ★4.93 · 1 bedroom ... | 46734 | 2009-10-18 | Chicago, IL | 100% | f | Lincoln Park | Lincoln Park | 41.923570 | ... | 10 | 15 | 4.93 | 4.67 | 5.00 | 5.00 | 4.87 | 329 | 100.0 | False |
| 4 | 24833 | Rental unit in Chicago · ★4.29 · 1 bedroom · 1... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.925961 | ... | 395 | 42 | 4.29 | 4.07 | 4.03 | 4.90 | 4.20 | 61 | 90.0 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8523 | 976998614768513719 | Rental unit in Chicago · ★New · 5 bedrooms · 8... | 11120895 | 2014-01-08 | Chicago, IL | 98% | t | Tri-Taylor | Near West Side | 41.875333 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 438 | 98.0 | True |
| 8524 | 977144971555557664 | Home in Chicago · ★New · 1 bedroom · 1 bed · 2... | 260588213 | 2019-05-08 | Chicago, IL | 100% | f | McKinley Park | Mckinley Park | 41.825997 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 32 | 100.0 | False |
| 8525 | 977746614717655088 | Home in Chicago · ★New · 3 bedrooms · 3 beds ·... | 536514763 | 2023-09-10 | NaN | NaN | f | West Englewood | West Englewood | 41.763133 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 24 | NaN | False |
| 8526 | 977980712187801208 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Loop | 41.869479 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 127 | 97.0 | False |
| 8527 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 366 | 99.0 | False |
8528 rows × 29 columns
df.drop('bathrooms', axis=1, inplace=True)
rel_var = df[['price2','host_acceptance_rate2', 'host_is_superhost2','beds', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy']]
rel_var.describe()
| price2 | host_acceptance_rate2 | beds | number_of_reviews | review_scores_rating | review_scores_accuracy | |
|---|---|---|---|---|---|---|
| count | 8528.0 | 7779.000000 | 8482.000000 | 8528.000000 | 6864.000000 | 6852.000000 |
| mean | 198.283185 | 91.912585 | 2.245461 | 46.595333 | 4.739273 | 4.789812 |
| std | 255.263629 | 17.881487 | 1.705293 | 87.779889 | 0.454732 | 0.388242 |
| min | 12.0 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 87.0 | 94.000000 | 1.000000 | 1.000000 | 4.687500 | 4.750000 |
| 50% | 135.0 | 98.000000 | 2.000000 | 14.000000 | 4.860000 | 4.900000 |
| 75% | 214.0 | 100.000000 | 3.000000 | 54.000000 | 4.980000 | 5.000000 |
| max | 5000.0 | 100.000000 | 24.000000 | 3332.000000 | 5.000000 | 5.000000 |
While doing the first walkthrough of my data, I was perplexed that there was a couple of places that were actually renting as low as 12 dollars a night. This signifys that there are actually a good amount of places where a person can get a good deal for so low. The highest priced places were either large mansion sized houses or high-rises which had a large amount of beds, types of places that I did not even know could be rented out this way through AirBnB.
In the data for the city that I chose, the variable bathrooms is odd. Is this the case for the city you chose? Look at the data. What should you do with this variable? Do it. After doing that, turn to compute the summary statistics of the data and describe them!
Make sure you provide an interesting discussion of your summary statistics. If you only provide a table without any commentary/discussion, this exercise is useless. Remember, you are analyzing the data and you must explain it to a third person who has never seen these data (e.g., your boss) and you want to help this person to learn from the data. If you only provide a list of facts, then you are not doing your job as the analyst.
Specifically, create a correlation matrix plot for review_scores_accuracy, review_scores_communication, review_scores_location,review_scores_value,minimum_nights, maximum_nights
# Correlation matrix
matrix = df[['review_scores_accuracy','review_scores_communication','review_scores_location','review_scores_value','minimum_nights','maximum_nights']]
matrix.corr()
| review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | minimum_nights | maximum_nights | |
|---|---|---|---|---|---|---|
| review_scores_accuracy | 1.000000 | 0.717416 | 0.513477 | 0.789684 | -0.022795 | -0.015459 |
| review_scores_communication | 0.717416 | 1.000000 | 0.405164 | 0.705978 | -0.029796 | -0.044169 |
| review_scores_location | 0.513477 | 0.405164 | 1.000000 | 0.525811 | -0.055962 | 0.006788 |
| review_scores_value | 0.789684 | 0.705978 | 0.525811 | 1.000000 | -0.037684 | -0.027540 |
| minimum_nights | -0.022795 | -0.029796 | -0.055962 | -0.037684 | 1.000000 | 0.131693 |
| maximum_nights | -0.015459 | -0.044169 | 0.006788 | -0.027540 | 0.131693 | 1.000000 |
Look at prices. Do you think the data looks reasonable? What types of properties and rooms are in the data?
Create a new dataframe called, for example, df2 that restricts attention to room_type == Entire home/apt. Once you have that, create another dataframe, called df3 that takes df2 and restricts attention to the following property_types: Entire residential home, Entire rental unit, Entire condominium (condo), Entire guesthouse, Entire serviced apartment, Entire townhouse, and Entire bungalow.
# Let's look at price
price_analysis = df[['price2']]
price_analysis.describe()
| price2 | |
|---|---|
| count | 8528.0 |
| mean | 198.283185 |
| std | 255.263629 |
| min | 12.0 |
| 25% | 87.0 |
| 50% | 135.0 |
| 75% | 214.0 |
| max | 5000.0 |
There are a couple of things that call our attention here. First, what is the minimum price? The 25th percentile? The 99th percentile? The maximum price? Does this suggest that we need to do further cleaning of the data? Why?
To do this, note that we have two variables called room_type and property_type. Let's restrict first to room_type == Entire home/apt.
# Create `df2` as defined above
df2 = pd.DataFrame()
df2 = df[df['room_type'] == 'Entire home/apt']
Let's restrict the analysis, from now on, to properties of the types: Entire residential home, Entire rental unit, Entire condominium (condo), Entire guesthouse, Entire serviced apartment, Entire townhouse, and Entire bungalow.
# Create `df3` as defined above
prop_types = [
'Entire residential home',
'Entire rental unit',
'Entire condominium (condo)',
'Entire guesthouse',
'Entire serviced apartment',
'Entire townhouse',
'Entire bungalow'
]
df3 = df2[df2['property_type'].isin(prop_types)]
Specifically, create df4 that takes df3 and keeps observations that have prices that are above the 5th and below the 95th percentile of the distribution of price in df3. Plot the distribution of prices (remember that price is a continuous variable!). How many listings are left? How many hosts?
# Create `df4` as defined above
price_5th_percentile = df3['price2'].quantile(0.05)
price_95th_percentile = df3['price2'].quantile(0.95)
# Create df4 by filtering df3 based on the specified price range
df4 = df3[(df3['price2'] > price_5th_percentile) & (df3['price2'] < price_95th_percentile)]
# Plot the distribution
plt.figure(figsize=(10, 6))
plt.hist(df4['price2'], bins=30, color='lightgreen', edgecolor='black')
plt.title('Distribution of Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
How many listings remain? Unique hosts?
# Calculate the number of listings and unique hosts.
listings = len(df4)
hosts = df4['host_id'].nunique()
print(f"Number of listings in df4: {listings}")
print(f"Number of unique hosts in df4: {hosts}")
Number of listings in df4: 4241 Number of unique hosts in df4: 1919
Now that we have a cleaned dataset (though we'll clean it more later...), use folium to plot the locations.
Chicago = folium.Map(location=[41.87466362949316, -87.62469034613902], zoom_start=10.5)
for index, row in df3.iterrows():
folium.Marker([row['latitude'], row['longitude']], tooltip=row['name']).add_to(Chicago)
Chicago
There are fewer hosts than listings. This means that some hosts manage more than one listing.
This dataframe, called df_hosts, should have, for each unique host, the total number of listings that she manages, the average price of the listings she manages, and its super host status. The idea is to check if hosts that manage more than one listing charge higher or lower prices on average and to examine how the average price varies with the number of listings managed by each host (and the same for super host status).
# Create `df_hosts`. A good way to do this is to use `groupby` focusing on the different variables of interest and the operations you want to do with each of them.
#Group by 'host_id' and calculate total number of listings, average price, and super host status
df_hosts = df4.groupby('host_id').agg({
'id': 'count', #Total number of listings
'price2': 'mean', #Average price
'host_is_superhost': 'max' #Superhost status
}).reset_index()
df_hosts.columns = ['host_id', 'total_listings', 'average_price', 'superhost_status']
df_hosts
| host_id | total_listings | average_price | superhost_status | |
|---|---|---|---|---|
| 0 | 2153 | 10 | 101.9 | f |
| 1 | 7529 | 1 | 135.0 | t |
| 2 | 17928 | 1 | 92.0 | t |
| 3 | 31673 | 1 | 141.0 | t |
| 4 | 33004 | 4 | 155.5 | t |
| ... | ... | ... | ... | ... |
| 1914 | 531889983 | 1 | 265.0 | f |
| 1915 | 532571501 | 1 | 261.0 | f |
| 1916 | 533116573 | 1 | 106.0 | f |
| 1917 | 533550346 | 2 | 148.5 | f |
| 1918 | 535521512 | 1 | 150.0 | f |
1919 rows × 4 columns
# Make a scatterplot (`sns.scatterplot`) that has the number of listings per host on the horizontal axis and the mean price charged by that host on the vertical axis.
plt.figure(figsize=(12, 8))
sns.scatterplot(x='total_listings', y='average_price', data=df_hosts)
plt.title('Number of Listings per Host vs Mean Price')
plt.xlabel('Number of Listings per Host')
plt.ylabel('Mean Price')
plt.show()
It is clear that there are few hosts with many listings. Let's identify these and drop them, and their listings, from the data. Before doing this, however, let's see what type of listings belong to these hosts.
Take df4 and create a new dataframe, df5 that drops hosts with more than 50 listings (and those listings as well).
# Identify the hosts that have more than 50 listings (use logical indexing to do this)
more_than50 = df4.groupby('host_id').agg({
'id': 'count' #Total number of listings
}).reset_index()
more_than50.columns = ['host_id', 'total_listings']
to_drop = more_than50[more_than50['total_listings'] > 50]
to_drop
| host_id | total_listings | |
|---|---|---|
| 123 | 3965428 | 73 |
| 608 | 47172572 | 57 |
| 925 | 107434423 | 645 |
| 1581 | 395498979 | 61 |
top_hosts = df4[df4['host_id'].isin(to_drop['host_id'])]
top_hosts
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | price2 | host_acceptance_rate2 | host_is_superhost2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59 | 1083987 | Rental unit in Chicago · ★5.0 · 1 bedroom · 2 ... | 3965428 | 2012-10-24 | Chicago, IL | 69% | t | South Loop/Printers Row | Near South Side | 41.853950 | ... | 1123 | 23 | 5.00 | 5.00 | 5.00 | 4.81 | 4.69 | 120 | 69.0 | True |
| 82 | 1732652 | Rental unit in Chicago · ★4.67 · 1 bedroom · 1... | 3965428 | 2012-10-24 | Chicago, IL | 69% | t | South Loop/Printers Row | Near North Side | 41.892690 | ... | 365 | 7 | 4.67 | 4.40 | 4.67 | 5.00 | 4.50 | 153 | 69.0 | True |
| 95 | 1944802 | Rental unit in Chicago · ★4.69 · 2 bedrooms · ... | 3965428 | 2012-10-24 | Chicago, IL | 69% | t | South Loop/Printers Row | Near South Side | 41.854870 | ... | 1125 | 14 | 4.69 | 5.00 | 5.00 | 4.46 | 4.62 | 200 | 69.0 | True |
| 112 | 2583929 | Rental unit in Chicago · ★4.92 · 1 bedroom · 2... | 3965428 | 2012-10-24 | Chicago, IL | 69% | t | South Loop/Printers Row | Near South Side | 41.852260 | ... | 1122 | 16 | 4.92 | 4.92 | 5.00 | 4.69 | 4.77 | 145 | 69.0 | True |
| 113 | 2661465 | Rental unit in Chicago · ★4.71 · 1 bedroom · 1... | 3965428 | 2012-10-24 | Chicago, IL | 69% | t | South Loop/Printers Row | Loop | 41.887650 | ... | 365 | 25 | 4.71 | 4.71 | 4.71 | 4.92 | 4.63 | 160 | 69.0 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8507 | 975295012005513617 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | West Town | 41.894776 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 102 | 97.0 | False |
| 8508 | 975295340703444048 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | West Town | 41.894776 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 121 | 97.0 | False |
| 8513 | 975802849898108433 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Lake View | 41.940411 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 229 | 97.0 | False |
| 8514 | 975806375396218857 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Loop | 41.869480 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 148 | 97.0 | False |
| 8526 | 977980712187801208 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Loop | 41.869479 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 127 | 97.0 | False |
836 rows × 28 columns
df4
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | price2 | host_acceptance_rate2 | host_is_superhost2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | 60 | 505 | 4.70 | 4.84 | 4.87 | 4.89 | 4.75 | 92 | 95.0 | True |
| 2 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | 1125 | 72 | 4.65 | 4.76 | 4.82 | 4.99 | 4.65 | 170 | 95.0 | True |
| 5 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | 300 | 51 | 4.30 | 4.57 | 4.13 | 4.92 | 4.35 | 95 | 90.0 | True |
| 10 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | 30 | 17 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 105 | 92.0 | True |
| 11 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | 1125 | 72 | 4.78 | 4.81 | 4.61 | 4.89 | 4.73 | 204 | 99.0 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8517 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 248 | 90.0 | False |
| 8521 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 150 | 100.0 | True |
| 8522 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | 122 | 0 | NaN | NaN | NaN | NaN | NaN | 176 | NaN | False |
| 8526 | 977980712187801208 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 107434423 | 2016-12-16 | New York, NY | 97% | f | Cambridge | Loop | 41.869479 | ... | 1125 | 0 | NaN | NaN | NaN | NaN | NaN | 127 | 97.0 | False |
| 8527 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 366 | 99.0 | False |
4241 rows × 28 columns
df5 = df4[~df4['host_id'].isin(top_hosts['host_id'])]
df5
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | maximum_nights | number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | price2 | host_acceptance_rate2 | host_is_superhost2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | 60 | 505 | 4.70 | 4.84 | 4.87 | 4.89 | 4.75 | 92 | 95.0 | True |
| 2 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | 1125 | 72 | 4.65 | 4.76 | 4.82 | 4.99 | 4.65 | 170 | 95.0 | True |
| 5 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | 300 | 51 | 4.30 | 4.57 | 4.13 | 4.92 | 4.35 | 95 | 90.0 | True |
| 10 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | 30 | 17 | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 | 105 | 92.0 | True |
| 11 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | 1125 | 72 | 4.78 | 4.81 | 4.61 | 4.89 | 4.73 | 204 | 99.0 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8511 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 265 | 96.0 | True |
| 8517 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 248 | 90.0 | False |
| 8521 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 150 | 100.0 | True |
| 8522 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | 122 | 0 | NaN | NaN | NaN | NaN | NaN | 176 | NaN | False |
| 8527 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | 365 | 0 | NaN | NaN | NaN | NaN | NaN | 366 | 99.0 | False |
3405 rows × 28 columns
df5¶# Merge the information as described above
small_hosts = df5.groupby('host_id').agg({
'id': 'count', #Total number of listings
'price2': 'mean', #Average price
'host_is_superhost': 'max' #Superhost status
}).reset_index()
small_hosts.columns = ['host_id', 'total_listings', 'average_price', 'superhost_status']
merged_df = pd.merge(df5, small_hosts, on='host_id', how='left')
merged_df
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | review_scores_accuracy | review_scores_communication | review_scores_location | review_scores_value | price2 | host_acceptance_rate2 | host_is_superhost2 | total_listings | average_price | superhost_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | 4.84 | 4.87 | 4.89 | 4.75 | 92 | 95.0 | True | 1 | 92.0 | t |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | 4.76 | 4.82 | 4.99 | 4.65 | 170 | 95.0 | True | 4 | 155.5 | t |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | 4.57 | 4.13 | 4.92 | 4.35 | 95 | 90.0 | True | 3 | 130.333333 | t |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | 5.00 | 5.00 | 5.00 | 5.00 | 105 | 92.0 | True | 1 | 105.0 | t |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | 4.81 | 4.61 | 4.89 | 4.73 | 204 | 99.0 | False | 20 | 244.5 | f |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | NaN | NaN | NaN | NaN | 265 | 96.0 | True | 1 | 265.0 | t |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | NaN | NaN | NaN | NaN | 248 | 90.0 | False | 16 | 291.6875 | f |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | NaN | NaN | NaN | NaN | 150 | 100.0 | True | 1 | 150.0 | t |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | NaN | NaN | NaN | NaN | 176 | NaN | False | 1 | 176.0 | f |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | NaN | NaN | NaN | NaN | 366 | 99.0 | False | 26 | 147.576923 | f |
3405 rows × 31 columns
In this section, we'll use statsmodels, patsy, and linearmodels as we did in the ML handout to examine how listing and host attributes are related to listing price. Note that some variables have many missing observations so while we won't drop them explicitly, linearmodels won't consider those observations in the analysis.
Define the dataframe panel by taking df5 and restricting to df5[['host_id','bedrooms', 'beds','number_of_reviews','review_scores_rating','price2','host_is_superhost2', 'n_listings']].
You will also need to define df5['time']=1 to serve as a time fixed effect (that we won't use) and set panel = panel.set_index(['host_id','time']) as the indexes to be used in estimation. Then proceed to add a constant to panel when defining the exogenous variables that go into the regression (we did this in the ML handout). Define the dependent variable to be the natural logarithm of the price.
Choose five locations in the city that you chose, that you consider would be of interest to a tourist. Make sure these locations are not near each other but rather distant from each other (this will result in generating clusters of related listings). Then, compute the linear distance (i.e., Haversine) from each listing to the five locations of interest and add this information to df5. Do not compute driving distances.
df5 = merged_df
df5['time']=1
panel = df5[['host_id','bedrooms', 'beds','number_of_reviews','review_scores_rating','price2','host_is_superhost2','total_listings']]
panel = df5.set_index(['host_id','time'])
from shapely.geometry import Polygon, Point, LineString
from pyproj import CRS
lat_point_list = [41.64098624147555, 41.612691128743016, 42.27230259745743, 42.18361979589204]
lon_point_list = [-88.03606370170382,-87.31309668383324,-87.71827343224777,-88.23539863535352]
polygon_geom = Polygon(zip(lon_point_list, lat_point_list))
crs = CRS('epsg:4326')
polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[polygon_geom])
polygon['name']='Bounding_box'
print(polygon.geometry)
0 POLYGON ((-88.03606 41.64099, -87.31310 41.612... Name: geometry, dtype: geometry
polygon.crs
<Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World. - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
tourists_dict = {
'latitude': [41.88266066881203, 41.92163043325434, 42.14395570837075, 41.88221147788901, 41.79240081830397],
'longitude': [-87.62334364343683, -87.6340144787165, -87.79010853741309, -87.71624427481888, -87.58362020107282]
}
tourist_gdf = gpd.GeoDataFrame(
{'Location': ['the_bean', 'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory', 'science_museum'],
'geometry': [Point(lon, lat) for lat, lon in zip(tourists_dict['latitude'], tourists_dict['longitude'])]},
crs="EPSG:4326"
)
tourist_gdf = gpd.GeoDataFrame(
{'Location': ['the_bean', 'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory', 'science_museum'],
'geometry': [Point(lat, lon) for lat, lon in zip(tourists_dict['latitude'], tourists_dict['longitude'])]},
crs="EPSG:4326"
)
tourist_gdf = tourist_gdf.set_index('Location')
df5['the_bean'] = tourist_gdf.loc['the_bean', 'geometry']
df5['lincoln_park_zoo'] = tourist_gdf.loc['lincoln_park_zoo', 'geometry']
df5['botanic_garden'] = tourist_gdf.loc['botanic_garden', 'geometry']
df5['garfield_conservatory'] = tourist_gdf.loc['garfield_conservatory', 'geometry']
df5['science_museum'] = tourist_gdf.loc['science_museum', 'geometry']
geometry_points = [Point(lat, lon) for lat, lon in zip(df5['latitude'], df5['longitude'])]
df5['point'] = geometry_points
df5 = gpd.GeoDataFrame(df5, geometry='point', crs="EPSG:4326")
df5
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | total_listings | average_price | superhost_status | time | the_bean | lincoln_park_zoo | botanic_garden | garfield_conservatory | science_museum | point | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | 1 | 92.0 | t | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | 4 | 155.5 | t | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | 3 | 130.333333 | t | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | 1 | 105.0 | t | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | 20 | 244.5 | f | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | 1 | 265.0 | t | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.86309 -87.62668) |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | 16 | 291.6875 | f | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89183 -87.63820) |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | 1 | 150.0 | t | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.87319 -87.63408) |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | 1 | 176.0 | f | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88723 -87.64413) |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | 26 | 147.576923 | f | 1 | POINT (41.88266066881203 -87.62334364343683) | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94847 -87.65723) |
3405 rows × 38 columns
# Write down the function that you will use to loop over all listings and compute the distance to the points of interest.
# You need to pass a dataframe as input (for example, `df5`) and get back a dataframe with all the information of the
# original dataframe plus two columns with distance to the locations of interest. Call the output dataframe `df6`
from math import radians, sin, cos, atan2, sqrt
df6 = df5
def haversine_distance(coord1, coord2):
# Haversine formula
lat1, lon1 = coord1.y, coord1.x
lat2, lon2 = coord2.y, coord2.x
# Convert latitude and longitude from degrees to radians
lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
# Haversine formula
dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
radius_of_earth = 6371000 # Radius of Earth in meters
distance = radius_of_earth * c
return distance
df6['distance_to_the_bean'] = df5.apply(lambda row: haversine_distance(row['point'], row['the_bean']), axis=1)
df6['distance_to_lincoln_park_zoo'] = df5.apply(lambda row: haversine_distance(row['point'], row['lincoln_park_zoo']), axis=1)
df6['distance_to_botanic_garden'] = df5.apply(lambda row: haversine_distance(row['point'], row['botanic_garden']), axis=1)
df6['distance_to_garfield_conservatory'] = df5.apply(lambda row: haversine_distance(row['point'], row['garfield_conservatory']), axis=1)
df6['distance_to_science_museum'] = df5.apply(lambda row: haversine_distance(row['point'], row['science_museum']), axis=1)
df6
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | lincoln_park_zoo | botanic_garden | garfield_conservatory | science_museum | point | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.86309 -87.62668) | 381.965796 | 858.650326 | 18215.047133 | 9959.287607 | 4799.461418 |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89183 -87.63820) | 1652.200318 | 484.782145 | 16928.675828 | 8678.524792 | 6086.181005 |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.87319 -87.63408) | 1195.067467 | 222.496468 | 17390.673562 | 9135.896752 | 5623.819314 |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88723 -87.64413) | 2311.432585 | 1135.778420 | 16271.812047 | 8018.773304 | 6742.687130 |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94847 -87.65723) | 3780.016837 | 2584.357797 | 14800.601024 | 6568.821704 | 8216.681882 |
3405 rows × 43 columns
folium map the location of the listings in df5 as well as of the locations of interest. Use one color for all the listings and a different color for the two locations of interest.¶from folium.plugins import MarkerCluster
df6 = pd.DataFrame(df5)
map_center = [df5['latitude'].mean(), df5['longitude'].mean()]
mymap = folium.Map(location=map_center, zoom_start=12)
marker_cluster = MarkerCluster().add_to(mymap)
for index, row in df5.iterrows():
folium.Marker([row['latitude'], row['longitude']], popup=index, icon=folium.Icon(color='blue')).add_to(marker_cluster)
locations_of_interest = {
'the_bean': {'latitude': 41.88266066881203, 'longitude': -87.62334364343683},
'lincoln_park_zoo': {'latitude': 41.92163043325434, 'longitude': -87.6340144787165},
'botanic_garden': {'latitude': 42.14395570837075, 'longitude': -87.79010853741309},
'garfield_conservatory': {'latitude': 41.88221147788901, 'longitude': -87.71624427481888},
'science_museum': {'latitude': 41.79240081830397, 'longitude': -87.58362020107282},
}
for location, coords in locations_of_interest.items():
folium.Marker([coords['latitude'], coords['longitude']], popup=location, icon=folium.Icon(color='red')).add_to(mymap)
mymap
import statsmodels.api as sm
from patsy import dmatrices
from linearmodels.panel import PanelOLS
df5
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | lincoln_park_zoo | botanic_garden | garfield_conservatory | science_museum | point | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.86309 -87.62668) | 381.965796 | 858.650326 | 18215.047133 | 9959.287607 | 4799.461418 |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89183 -87.63820) | 1652.200318 | 484.782145 | 16928.675828 | 8678.524792 | 6086.181005 |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.87319 -87.63408) | 1195.067467 | 222.496468 | 17390.673562 | 9135.896752 | 5623.819314 |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88723 -87.64413) | 2311.432585 | 1135.778420 | 16271.812047 | 8018.773304 | 6742.687130 |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94847 -87.65723) | 3780.016837 | 2584.357797 | 14800.601024 | 6568.821704 | 8216.681882 |
3405 rows × 43 columns
# Do all the data manipulation as described above and in the ML module
#My dependent variable is price2
panel = df5[['host_id', 'bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'total_listings','time']]
panel = panel.set_index(['host_id', 'time'])
panel['constant'] = 1
panel = panel.dropna()
dependent = panel.price2
exog = sm.add_constant(panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'total_listings']])
panel.dtypes
bedrooms float64 beds float64 number_of_reviews int64 review_scores_rating float64 price2 Int64 host_is_superhost2 bool total_listings int64 constant int64 dtype: object
panel
| bedrooms | beds | number_of_reviews | review_scores_rating | price2 | host_is_superhost2 | total_listings | constant | ||
|---|---|---|---|---|---|---|---|---|---|
| host_id | time | ||||||||
| 17928 | 1 | 1.0 | 1.0 | 505 | 4.70 | 92 | True | 1 | 1 |
| 33004 | 1 | 2.0 | 2.0 | 72 | 4.65 | 170 | True | 4 | 1 |
| 101521 | 1 | 2.0 | 3.0 | 51 | 4.30 | 95 | True | 3 | 1 |
| 504470 | 1 | 1.0 | 1.0 | 17 | 5.00 | 105 | True | 1 | 1 |
| 683529 | 1 | 3.0 | 3.0 | 72 | 4.78 | 204 | False | 20 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 518917937 | 1 | 1.0 | 2.0 | 1 | 1.00 | 136 | False | 26 | 1 |
| 243035563 | 1 | 2.0 | 3.0 | 1 | 5.00 | 248 | False | 30 | 1 |
| 66596522 | 1 | 1.0 | 3.0 | 1 | 5.00 | 230 | False | 2 | 1 |
| 235678953 | 1 | 2.0 | 3.0 | 1 | 3.00 | 216 | True | 16 | 1 |
| 67751745 | 1 | 2.0 | 2.0 | 1 | 5.00 | 367 | False | 7 | 1 |
2719 rows × 8 columns
df5
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | lincoln_park_zoo | botanic_garden | garfield_conservatory | science_museum | point | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.86309 -87.62668) | 381.965796 | 858.650326 | 18215.047133 | 9959.287607 | 4799.461418 |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89183 -87.63820) | 1652.200318 | 484.782145 | 16928.675828 | 8678.524792 | 6086.181005 |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.87319 -87.63408) | 1195.067467 | 222.496468 | 17390.673562 | 9135.896752 | 5623.819314 |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88723 -87.64413) | 2311.432585 | 1135.778420 | 16271.812047 | 8018.773304 | 6742.687130 |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | POINT (41.92163043325434 -87.6340144787165) | POINT (42.14395570837075 -87.79010853741309) | POINT (41.88221147788901 -87.71624427481888) | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94847 -87.65723) | 3780.016837 | 2584.357797 | 14800.601024 | 6568.821704 | 8216.681882 |
3405 rows × 43 columns
# Estimate the model without fixed effects as we did in the ML module. First do this without the distance variables and then include these and re-estimate. What happened with the estimated coefficients? Did anything happen with the price coefficient?
# Model without distance variables
dependent_variable = panel['price2']
independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'host_is_superhost2', 'total_listings']]
modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
--------------------------------------------------------------------------- AbsorbingEffectError Traceback (most recent call last) Cell In[63], line 8 5 independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'host_is_superhost2', 'total_listings']] 7 modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False) ----> 8 resFE = modFE.fit(cov_type='unadjusted') 9 print(resFE) File C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\model.py:1856, in PanelOLS.fit(self, use_lsdv, use_lsmr, low_memory, cov_type, debiased, auto_df, count_effects, **cov_config) 1854 if self.entity_effects or self.time_effects or self.other_effects: 1855 if not self._drop_absorbed: -> 1856 check_absorbed(x, [str(var) for var in self.exog.vars]) 1857 else: 1858 # TODO: Need to special case the constant here when determining which 1859 # to retain since we always want to retain the constant if present 1860 retain = not_absorbed(x, self._constant, self._constant_index) File C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\utility.py:435, in check_absorbed(x, variables, x_orig) 433 absorbed_variables = "\n".join(rows) 434 msg = absorbing_error_msg.format(absorbed_variables=absorbed_variables) --> 435 raise AbsorbingEffectError(msg) 436 if x_orig is None: 437 return AbsorbingEffectError: The model cannot be estimated. The included effects have fully absorbed one or more of the variables. This occurs when one or more of the dependent variable is perfectly explained using the effects included in the model. The following variables or variable combinations have been fully absorbed or have become perfectly collinear after effects are removed: host_is_superhost2 total_listings Set drop_absorbed=True to automatically drop absorbed variables.
That means the two variables are perfectly explained by our dependent variable of price, I'll have to drop them to continue. The beds and and bedrooms will obviously have a positive correlation, it looks like that +1 bed creates an expected value of the price increasing by 12 dollars and +1 bedrooms having an increase of 18 dollars. The p values are suggesting that the data we are finding is also statistically significant as well
dependent_variable = panel['price2']
independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating']]
modFE = PanelOLS(dependent_variable, independent_variables, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: price2 R-squared: 0.8431
Estimator: PanelOLS R-squared (Between): 0.8636
No. Observations: 2719 R-squared (Within): 0.0000
Date: Mon, Dec 04 2023 R-squared (Overall): 0.8431
Time: 05:36:23 Log-likelihood -1.555e+04
Cov. Estimator: Unadjusted
F-statistic: 3648.3
Entities: 1669 P-value 0.0000
Avg Obs: 1.6291 Distribution: F(4,2715)
Min Obs: 1.0000
Max Obs: 29.000 F-statistic (robust): 3648.3
P-value 0.0000
Time periods: 1 Distribution: F(4,2715)
Avg Obs: 2719.0
Min Obs: 2719.0
Max Obs: 2719.0
Parameter Estimates
========================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
----------------------------------------------------------------------------------------
bedrooms 18.321 2.5293 7.2435 0.0000 13.362 23.281
beds 12.985 1.7139 7.5761 0.0000 9.6241 16.346
number_of_reviews -0.0689 0.0164 -4.2042 0.0000 -0.1010 -0.0368
review_scores_rating 21.586 0.7593 28.430 0.0000 20.098 23.075
========================================================================================
panel_withdistance = df5[['host_id', 'bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'total_listings','time', 'distance_to_the_bean','distance_to_lincoln_park_zoo','distance_to_botanic_garden','distance_to_garfield_conservatory','distance_to_science_museum']]
panel_withdistance = panel_withdistance.set_index(['host_id', 'time'])
dependent_variable = panel_withdistance['price2']
independent_variables = panel_withdistance[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'distance_to_the_bean','distance_to_lincoln_park_zoo','distance_to_botanic_garden','distance_to_garfield_conservatory','distance_to_science_museum']]
modFE = PanelOLS(dependent_variable, independent_variables,time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: price2 R-squared: 0.8635
Estimator: PanelOLS R-squared (Between): 0.8727
No. Observations: 2719 R-squared (Within): 0.0000
Date: Mon, Dec 04 2023 R-squared (Overall): 0.8635
Time: 05:36:24 Log-likelihood -1.536e+04
Cov. Estimator: Unadjusted
F-statistic: 1904.7
Entities: 1669 P-value 0.0000
Avg Obs: 1.6291 Distribution: F(9,2710)
Min Obs: 1.0000
Max Obs: 29.000 F-statistic (robust): 1904.7
P-value 0.0000
Time periods: 1 Distribution: F(9,2710)
Avg Obs: 2719.0
Min Obs: 2719.0
Max Obs: 2719.0
Parameter Estimates
=====================================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-----------------------------------------------------------------------------------------------------
bedrooms 23.945 2.3964 9.9924 0.0000 19.247 28.644
beds 10.726 1.6044 6.6855 0.0000 7.5804 13.872
number_of_reviews -0.0534 0.0154 -3.4604 0.0005 -0.0837 -0.0232
review_scores_rating 10.164 3.3717 3.0145 0.0026 3.5526 16.775
distance_to_the_bean -0.0211 0.0032 -6.5499 0.0000 -0.0274 -0.0148
distance_to_lincoln_park_zoo 0.0052 0.0026 2.0068 0.0449 0.0001 0.0102
distance_to_botanic_garden -0.0004 0.0009 -0.4482 0.6540 -0.0021 0.0013
distance_to_garfield_conservatory 0.0045 0.0007 6.2075 0.0000 0.0031 0.0059
distance_to_science_museum 0.0124 0.0014 9.2035 0.0000 0.0098 0.0151
=====================================================================================================
C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\model.py:1214: MissingValueWarning: Inputs contain missing values. Dropping rows with missing observations. super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
entity_effects that correspond to the host_id. Add drop_absorbed = True when defining the problem.¶Let's now control for host FE to capture, for example, the ability of a host (as host with multiple listings may be more skilled). Note that some covariates will be dropped because the fixed effect will be collinear with any variable that doesn't change across listings (in particular, distance to touristic attractions and any variable that doesn't change within observations of the same listing). Discuss your findings.
panel
| bedrooms | beds | number_of_reviews | review_scores_rating | price2 | host_is_superhost2 | total_listings | constant | ||
|---|---|---|---|---|---|---|---|---|---|
| host_id | time | ||||||||
| 17928 | 1 | 1.0 | 1.0 | 505 | 4.70 | 92 | True | 1 | 1 |
| 33004 | 1 | 2.0 | 2.0 | 72 | 4.65 | 170 | True | 4 | 1 |
| 101521 | 1 | 2.0 | 3.0 | 51 | 4.30 | 95 | True | 3 | 1 |
| 504470 | 1 | 1.0 | 1.0 | 17 | 5.00 | 105 | True | 1 | 1 |
| 683529 | 1 | 3.0 | 3.0 | 72 | 4.78 | 204 | False | 20 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 518917937 | 1 | 1.0 | 2.0 | 1 | 1.00 | 136 | False | 26 | 1 |
| 243035563 | 1 | 2.0 | 3.0 | 1 | 5.00 | 248 | False | 30 | 1 |
| 66596522 | 1 | 1.0 | 3.0 | 1 | 5.00 | 230 | False | 2 | 1 |
| 235678953 | 1 | 2.0 | 3.0 | 1 | 3.00 | 216 | True | 16 | 1 |
| 67751745 | 1 | 2.0 | 2.0 | 1 | 5.00 | 367 | False | 7 | 1 |
2719 rows × 8 columns
# Estimate the model with fixed effects as we did in the ML module
from linearmodels.panel import PanelOLS
dependent_variable = panel['price2']
independent_variables = panel[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating']]
modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False, drop_absorbed=True)
resFE = modFE.fit()
print(resFE)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: price2 R-squared: 0.3276
Estimator: PanelOLS R-squared (Between): 0.5795
No. Observations: 2719 R-squared (Within): 0.0000
Date: Mon, Dec 04 2023 R-squared (Overall): 0.5504
Time: 05:36:26 Log-likelihood -1.249e+04
Cov. Estimator: Unadjusted
F-statistic: 127.41
Entities: 1669 P-value 0.0000
Avg Obs: 1.6291 Distribution: F(4,1046)
Min Obs: 1.0000
Max Obs: 29.000 F-statistic (robust): 127.41
P-value 0.0000
Time periods: 1 Distribution: F(4,1046)
Avg Obs: 2719.0
Min Obs: 2719.0
Max Obs: 2719.0
Parameter Estimates
========================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
----------------------------------------------------------------------------------------
bedrooms 23.193 2.8357 8.1789 0.0000 17.629 28.757
beds 12.269 1.9634 6.2489 0.0000 8.4165 16.122
number_of_reviews -0.0133 0.0214 -0.6196 0.5356 -0.0552 0.0287
review_scores_rating -1.7912 3.7711 -0.4750 0.6349 -9.1910 5.6086
========================================================================================
F-test for Poolability: 5.3126
P-value: 0.0000
Distribution: F(1668,1046)
Included effects: Entity
panel_withdistance
| bedrooms | beds | number_of_reviews | review_scores_rating | price2 | host_is_superhost2 | total_listings | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| host_id | time | ||||||||||||
| 17928 | 1 | 1.0 | 1.0 | 505 | 4.70 | 92 | True | 1 | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 |
| 33004 | 1 | 2.0 | 2.0 | 72 | 4.65 | 170 | True | 4 | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 |
| 101521 | 1 | 2.0 | 3.0 | 51 | 4.30 | 95 | True | 3 | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 |
| 504470 | 1 | 1.0 | 1.0 | 17 | 5.00 | 105 | True | 1 | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 |
| 683529 | 1 | 3.0 | 3.0 | 72 | 4.78 | 204 | False | 20 | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 415825853 | 1 | 2.0 | 2.0 | 0 | NaN | 265 | True | 1 | 381.965796 | 858.650326 | 18215.047133 | 9959.287607 | 4799.461418 |
| 177188450 | 1 | 1.0 | 2.0 | 0 | NaN | 248 | False | 16 | 1652.200318 | 484.782145 | 16928.675828 | 8678.524792 | 6086.181005 |
| 441260321 | 1 | 1.0 | 1.0 | 0 | NaN | 150 | True | 1 | 1195.067467 | 222.496468 | 17390.673562 | 9135.896752 | 5623.819314 |
| 97961430 | 1 | 2.0 | 2.0 | 0 | NaN | 176 | False | 1 | 2311.432585 | 1135.778420 | 16271.812047 | 8018.773304 | 6742.687130 |
| 518917937 | 1 | 3.0 | 4.0 | 0 | NaN | 366 | False | 26 | 3780.016837 | 2584.357797 | 14800.601024 | 6568.821704 | 8216.681882 |
3405 rows × 12 columns
dependent_variable = panel_withdistance['price2']
independent_variables = panel_withdistance[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'distance_to_the_bean','distance_to_lincoln_park_zoo','distance_to_botanic_garden','distance_to_garfield_conservatory','distance_to_science_museum']]
modFE = PanelOLS(dependent_variable, independent_variables, entity_effects=True, time_effects=False, drop_absorbed=True)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: price2 R-squared: 0.3431
Estimator: PanelOLS R-squared (Between): 0.8418
No. Observations: 2719 R-squared (Within): 0.0000
Date: Mon, Dec 04 2023 R-squared (Overall): 0.8360
Time: 05:36:27 Log-likelihood -1.246e+04
Cov. Estimator: Unadjusted
F-statistic: 60.407
Entities: 1669 P-value 0.0000
Avg Obs: 1.6291 Distribution: F(9,1041)
Min Obs: 1.0000
Max Obs: 29.000 F-statistic (robust): 60.407
P-value 0.0000
Time periods: 1 Distribution: F(9,1041)
Avg Obs: 2719.0
Min Obs: 2719.0
Max Obs: 2719.0
Parameter Estimates
=====================================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-----------------------------------------------------------------------------------------------------
bedrooms 23.337 2.8139 8.2936 0.0000 17.816 28.859
beds 12.488 1.9494 6.4062 0.0000 8.6631 16.313
number_of_reviews -0.0207 0.0213 -0.9727 0.3309 -0.0624 0.0210
review_scores_rating -0.4863 3.7657 -0.1291 0.8973 -7.8755 6.9029
distance_to_the_bean -0.0171 0.0058 -2.9544 0.0032 -0.0284 -0.0057
distance_to_lincoln_park_zoo 0.0068 0.0034 1.9855 0.0473 7.962e-05 0.0135
distance_to_botanic_garden 0.0031 0.0045 0.6962 0.4864 -0.0056 0.0118
distance_to_garfield_conservatory 0.0024 0.0012 1.9179 0.0554 -5.481e-05 0.0048
distance_to_science_museum 0.0131 0.0072 1.8296 0.0676 -0.0009 0.0271
=====================================================================================================
F-test for Poolability: 4.6278
P-value: 0.0000
Distribution: F(1668,1041)
Included effects: Entity
C:\ProgramData\anaconda3\Lib\site-packages\linearmodels\panel\model.py:1214: MissingValueWarning: Inputs contain missing values. Dropping rows with missing observations. super().__init__(dependent, exog, weights=weights, check_rank=check_rank)
In this section we will prepare the scene for the main part of the analysis: Who competes with whom? How do rival prices impact the price charged for a listing? How does this depend on the number of rivals that a listing faces?
To address these questions, we will first determine which listings are direct competitors of each other. Instead of computing distances, as we did in class, we'll use Hierarchical Clustering as it is much less intensive in terms of memory usage and the size of the files that are generated. By using HC, we'll be able to identify listings that are near each other and belong to the same market/cluster. With this, we'll be able to determine which listings compete against each other.
Once we have the set of listings that compete with each other, we will compute, for each listing, the average price of its rivals' listings.
With all of this, we will examine how prices depend not only on the listing's attribute and the super host status of who manages it (and the number of listings that host manages), but also with the average price of the rival listings and the number of rivals in the cluster.
Let's get started. First step, determine which listings compete against each other. Create a matrix X as follows (adapt this code to your setting):
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
X = np.zeros((df6.shape[0],2))
X[:,0] = df6['latitude']
X[:,1] = df6['longitude']
Use the linkage function with method = centroid. Once you have the output:
dendrogram,After doing so, make sure to do the following:
sns.scatterplot that uses hue to color code the clusters.# Define the problem and use `linkage`
Z = linkage(X, method='centroid')
# Plot the dendrogram
plt.figure(figsize=(25,10))
plt.title('Hierarchal Clustering Dendrogram')
plt.xlabel('Index')
plt.ylabel('Distance')
dendrogram(
Z,
leaf_rotation=90.,
leaf_font_size=8.,
)
plt.show()
# Define the clusters using a distance measure of 0.005
distance_threshold = 0.005
clusters = fcluster(Z, t=distance_threshold, criterion='distance')
# Add the cluster ids to `df5`
df5['cluster_id'] = clusters
num_clusters = max(clusters)
f"The number of clusters is: {num_clusters}."
'The number of clusters is: 468.'
cluster_counts = df5['cluster_id'].value_counts()
plt.figure(figsize=(25,10))
plt.bar(cluster_counts.index, cluster_counts.values, color='red')
plt.xlabel('cluster id')
plt.ylabel('Number of Listings')
plt.title('Distribution of Number of Listings per Cluster')
plt.show()
cluster_counts.describe()
count 468.000000 mean 7.275641 std 9.373460 min 1.000000 25% 1.000000 50% 4.000000 75% 9.000000 max 61.000000 Name: cluster_id, dtype: float64
cluster_counts
107 61
105 51
110 48
128 48
106 46
..
341 1
389 1
380 1
290 1
293 1
Name: cluster_id, Length: 468, dtype: int64
cluster_counts_df = pd.DataFrame({'cluster_id': cluster_counts.index, 'number_of_Listings': cluster_counts.values})
plt.figure(figsize=(25, 10))
sns.scatterplot(data=cluster_counts_df, x='cluster_id', y='number_of_Listings', hue='cluster_id')
plt.title('Scatterplot of Number of Listings per Cluster')
plt.xlabel('cluster ID')
plt.ylabel('Number of Listings')
plt.show()
Now that we have the cluster ID for each observation, we can compute the average price of the cluster together with the number of observations that are in a cluster. Create df7 by merging df6 with the dataframe that has the mean price per cluster and the number of listings per cluster.
df6['cluster_id']=df5['cluster_id']
cluster_stats = df6.groupby('cluster_id')['price2'].agg(['mean', 'count']).reset_index()
cluster_stats.columns = ['cluster_id', 'mean_price', 'number_of_listings']
df7 = pd.merge(df6, cluster_stats, on='cluster_id', how='left')
df7
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | science_museum | point | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | cluster_id | mean_price | number_of_listings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 | 194 | 142.05 | 20 |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 | 125 | 184.384615 | 26 |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 | 22 | 227.727273 | 11 |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 | 52 | 124.6 | 5 |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 | 7 | 181.833333 | 30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.86309 -87.62668) | 381.965796 | 858.650326 | 18215.047133 | 9959.287607 | 4799.461418 | 133 | 210.965517 | 29 |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89183 -87.63820) | 1652.200318 | 484.782145 | 16928.675828 | 8678.524792 | 6086.181005 | 115 | 244.365854 | 41 |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.87319 -87.63408) | 1195.067467 | 222.496468 | 17390.673562 | 9135.896752 | 5623.819314 | 131 | 237.607143 | 28 |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88723 -87.64413) | 2311.432585 | 1135.778420 | 16271.812047 | 8018.773304 | 6742.687130 | 95 | 250.666667 | 36 |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94847 -87.65723) | 3780.016837 | 2584.357797 | 14800.601024 | 6568.821704 | 8216.681882 | 34 | 205.962963 | 27 |
3405 rows × 46 columns
Before we re-run the regressions, note that some clusters only have one observations. For these clusters, we cannot include the average price of the rivals in the regressions. We need to exclude these.
# Define `df8` as the subset of df7 that has more than 1 listing.
df8 = df7[df7['number_of_listings'] > 1]
# Define a new `panel2` dataframe, add the time fixed effect, define the indexes, and re-estimate the linear regressions.
panel2 = df8[['host_id', 'bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'number_of_listings','time']]
panel2 = panel2.set_index(['host_id', 'time'])
panel2['constant'] = 1
panel2 = panel2.dropna()
dependent = panel2.price2
exog = sm.add_constant(panel2[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'price2', 'host_is_superhost2', 'number_of_listings']])
dependent_variable = panel2['price2']
independent_variables = panel2[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating','number_of_listings']]
modFE = PanelOLS(dependent_variable, independent_variables, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: price2 R-squared: 0.8602
Estimator: PanelOLS R-squared (Between): 0.8734
No. Observations: 2613 R-squared (Within): 0.0000
Date: Mon, Dec 04 2023 R-squared (Overall): 0.8602
Time: 05:37:05 Log-likelihood -1.482e+04
Cov. Estimator: Unadjusted
F-statistic: 3209.4
Entities: 1577 P-value 0.0000
Avg Obs: 1.6569 Distribution: F(5,2608)
Min Obs: 1.0000
Max Obs: 29.000 F-statistic (robust): 3209.4
P-value 0.0000
Time periods: 1 Distribution: F(5,2608)
Avg Obs: 2613.0
Min Obs: 2613.0
Max Obs: 2613.0
Parameter Estimates
========================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
----------------------------------------------------------------------------------------
bedrooms 23.225 2.4844 9.3480 0.0000 18.353 28.096
beds 11.954 1.6695 7.1607 0.0000 8.6808 15.228
number_of_reviews -0.0727 0.0158 -4.6078 0.0000 -0.1037 -0.0418
review_scores_rating 13.549 0.8701 15.572 0.0000 11.843 15.255
number_of_listings 1.7910 0.1014 17.659 0.0000 1.5921 1.9898
========================================================================================
panel2
| bedrooms | beds | number_of_reviews | review_scores_rating | price2 | host_is_superhost2 | number_of_listings | constant | ||
|---|---|---|---|---|---|---|---|---|---|
| host_id | time | ||||||||
| 17928 | 1 | 1.0 | 1.0 | 505 | 4.70 | 92 | True | 20 | 1 |
| 33004 | 1 | 2.0 | 2.0 | 72 | 4.65 | 170 | True | 26 | 1 |
| 101521 | 1 | 2.0 | 3.0 | 51 | 4.30 | 95 | True | 11 | 1 |
| 504470 | 1 | 1.0 | 1.0 | 17 | 5.00 | 105 | True | 5 | 1 |
| 683529 | 1 | 3.0 | 3.0 | 72 | 4.78 | 204 | False | 30 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 518917937 | 1 | 1.0 | 2.0 | 1 | 1.00 | 136 | False | 37 | 1 |
| 243035563 | 1 | 2.0 | 3.0 | 1 | 5.00 | 248 | False | 48 | 1 |
| 66596522 | 1 | 1.0 | 3.0 | 1 | 5.00 | 230 | False | 36 | 1 |
| 235678953 | 1 | 2.0 | 3.0 | 1 | 3.00 | 216 | True | 36 | 1 |
| 67751745 | 1 | 2.0 | 2.0 | 1 | 5.00 | 367 | False | 9 | 1 |
2613 rows × 8 columns
dependent_variable = panel2['price2']
independent_variables = panel2[['bedrooms', 'beds', 'number_of_reviews', 'review_scores_rating', 'number_of_listings']]
modFE = PanelOLS(dependent_variable, independent_variables, time_effects=False)
resFE = modFE.fit(cov_type='unadjusted')
print(resFE)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: price2 R-squared: 0.8602
Estimator: PanelOLS R-squared (Between): 0.8734
No. Observations: 2613 R-squared (Within): 0.0000
Date: Mon, Dec 04 2023 R-squared (Overall): 0.8602
Time: 05:37:05 Log-likelihood -1.482e+04
Cov. Estimator: Unadjusted
F-statistic: 3209.4
Entities: 1577 P-value 0.0000
Avg Obs: 1.6569 Distribution: F(5,2608)
Min Obs: 1.0000
Max Obs: 29.000 F-statistic (robust): 3209.4
P-value 0.0000
Time periods: 1 Distribution: F(5,2608)
Avg Obs: 2613.0
Min Obs: 2613.0
Max Obs: 2613.0
Parameter Estimates
========================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
----------------------------------------------------------------------------------------
bedrooms 23.225 2.4844 9.3480 0.0000 18.353 28.096
beds 11.954 1.6695 7.1607 0.0000 8.6808 15.228
number_of_reviews -0.0727 0.0158 -4.6078 0.0000 -0.1037 -0.0418
review_scores_rating 13.549 0.8701 15.572 0.0000 11.843 15.255
number_of_listings 1.7910 0.1014 17.659 0.0000 1.5921 1.9898
========================================================================================
We'll now turn to the last part of the problem set. First, create a new variable that takes the value of one if the price of a listing is above the mean price of that of the market. Note that for clusters with a single listing this is not useful so we'll continue to work with the subset of the data that doesn't have clusters with a single listing. Note, also, that we'll need to clean and scale some variables before we can turn to predicting prices.
First, drop name, host_since, host_location, host_neighbourhood, neighbourhood_cleansed and host_acceptance_rate2 as we won't use them. Also drop host_id, accommodates, bedrooms and rows that have missing prices or missing review_scores_rating. Finally, of all the review* columns, keep review_scores_rating only and drop observations for which the host_is_superhost2 variable is missing..
df8
| id | name | host_id | host_since | host_location | host_acceptance_rate | host_is_superhost | host_neighbourhood | neighbourhood_cleansed | latitude | ... | science_museum | point | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | cluster_id | mean_price | number_of_listings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | Rental unit in Chicago · ★4.70 · 1 bedroom · 1... | 17928 | 2009-05-19 | Chicago, IL | 95% | t | Ukrainian Village | West Town | 41.901660 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 | 194 | 142.05 | 20 |
| 1 | 10945 | Rental unit in Chicago · ★4.65 · 2 bedrooms · ... | 33004 | 2009-08-21 | Chicago, IL | 95% | t | Old Town | Lincoln Park | 41.911960 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 | 125 | 184.384615 | 26 |
| 2 | 25879 | Rental unit in Chicago · ★4.30 · 2 bedrooms · ... | 101521 | 2010-03-31 | Chicago, IL | 90% | t | Lincoln Park | Lincoln Park | 41.924990 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 | 22 | 227.727273 | 11 |
| 3 | 94450 | Rental unit in Chicago · ★5.0 · 1 bedroom · 1 ... | 504470 | 2011-04-12 | Chicago, IL | 92% | t | Andersonville | Edgewater | 41.979600 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 | 52 | 124.6 | 5 |
| 4 | 145659 | Rental unit in Chicago · ★4.78 · 3 bedrooms · ... | 683529 | 2011-06-09 | Chicago, IL | 99% | f | Roscoe Village | North Center | 41.943420 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 | 7 | 181.833333 | 30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3400 | 975407160632841105 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 415825853 | 2021-07-29 | Calumet Park, IL | 96% | t | Near South Side | Near South Side | 41.863088 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.86309 -87.62668) | 381.965796 | 858.650326 | 18215.047133 | 9959.287607 | 4799.461418 | 133 | 210.965517 | 29 |
| 3401 | 976106807063239067 | Serviced apartment in Chicago · ★New · 1 bedro... | 177188450 | 2018-03-07 | Chicago, IL | 90% | f | River North | Near North Side | 41.891832 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89183 -87.63820) | 1652.200318 | 484.782145 | 16928.675828 | 8678.524792 | 6086.181005 | 115 | 244.365854 | 41 |
| 3402 | 976881098303922237 | Rental unit in Chicago · ★New · 1 bedroom · 1 ... | 441260321 | 2022-01-19 | Chicago, IL | 100% | t | Downtown Los Angeles | Loop | 41.873189 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.87319 -87.63408) | 1195.067467 | 222.496468 | 17390.673562 | 9135.896752 | 5623.819314 | 131 | 237.607143 | 28 |
| 3403 | 976914878462533348 | Rental unit in Chicago · ★New · 2 bedrooms · 2... | 97961430 | 2016-10-04 | Chicago, IL | NaN | f | West Loop | Near West Side | 41.887230 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88723 -87.64413) | 2311.432585 | 1135.778420 | 16271.812047 | 8018.773304 | 6742.687130 | 95 | 250.666667 | 36 |
| 3404 | 978264935480956783 | Rental unit in Chicago · ★New · 3 bedrooms · 4... | 518917937 | 2023-06-08 | Chicago, IL | 99% | f | West Town | Lake View | 41.948470 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94847 -87.65723) | 3780.016837 | 2584.357797 | 14800.601024 | 6568.821704 | 8216.681882 | 34 | 205.962963 | 27 |
3277 rows × 46 columns
columns_to_drop = ['name', 'host_since', 'host_location', 'host_neighbourhood', 'neighbourhood_cleansed', 'host_acceptance_rate2']
df9 = df8.drop(columns=columns_to_drop, inplace=False)
df9.columns
Index(['id', 'host_id', 'host_acceptance_rate', 'host_is_superhost',
'latitude', 'longitude', 'property_type', 'room_type', 'accommodates',
'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights',
'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'price2', 'host_is_superhost2', 'total_listings',
'average_price', 'superhost_status', 'time', 'the_bean',
'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory',
'science_museum', 'point', 'distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum',
'cluster_id', 'mean_price', 'number_of_listings'],
dtype='object')
df9 = df9.dropna(subset=['id', 'host_id', 'host_acceptance_rate', 'host_is_superhost',
'latitude', 'longitude', 'property_type', 'room_type', 'accommodates',
'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights',
'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'price2', 'host_is_superhost2', 'total_listings',
'average_price', 'superhost_status', 'time', 'the_bean',
'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory',
'science_museum', 'point', 'distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum',
'cluster_id', 'mean_price', 'number_of_listings'])
df9
| id | host_id | host_acceptance_rate | host_is_superhost | latitude | longitude | property_type | room_type | accommodates | bedrooms | ... | science_museum | point | distance_to_the_bean | distance_to_lincoln_park_zoo | distance_to_botanic_garden | distance_to_garfield_conservatory | distance_to_science_museum | cluster_id | mean_price | number_of_listings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7126 | 17928 | 95% | t | 41.901660 | -87.680210 | Entire rental unit | Entire home/apt | 2 | 1.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.90166 -87.68021) | 6323.842702 | 5137.509592 | 12266.427808 | 4007.769743 | 10752.015508 | 194 | 142.05 | 20 |
| 1 | 10945 | 33004 | 95% | t | 41.911960 | -87.639810 | Entire rental unit | Entire home/apt | 4 | 2.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.91196 -87.63981) | 1835.918505 | 645.955946 | 16744.020648 | 8500.159874 | 6272.529685 | 125 | 184.384615 | 26 |
| 2 | 25879 | 101521 | 90% | t | 41.924990 | -87.655730 | Entire rental unit | Entire home/apt | 6 | 2.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.92499 -87.65573) | 3606.412223 | 2414.704591 | 14973.467194 | 6731.620259 | 8041.584980 | 22 | 227.727273 | 11 |
| 3 | 94450 | 504470 | 92% | t | 41.979600 | -87.665120 | Entire rental unit | Entire home/apt | 1 | 1.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.97960 -87.66512) | 4666.399409 | 3468.863401 | 13916.954743 | 5701.480087 | 9103.333700 | 52 | 124.6 | 5 |
| 4 | 145659 | 683529 | 99% | f | 41.943420 | -87.681210 | Entire rental unit | Entire home/apt | 8 | 3.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.94342 -87.68121) | 6440.393583 | 5248.836639 | 12140.954107 | 3905.207423 | 10873.631977 | 7 | 181.833333 | 30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3358 | 965968083633594916 | 518917937 | 99% | f | 41.955120 | -87.651490 | Entire rental unit | Entire home/apt | 3 | 1.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.95512 -87.65149) | 3147.306075 | 1949.216084 | 15436.260635 | 7207.795509 | 7584.157672 | 30 | 138.405405 | 37 |
| 3365 | 967385536206513874 | 243035563 | 100% | f | 41.856172 | -87.627915 | Entire rental unit | Entire home/apt | 6 | 2.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.85617 -87.62792) | 522.801616 | 741.922360 | 18080.299439 | 9822.420541 | 4934.306813 | 128 | 254.729167 | 48 |
| 3367 | 967598679107146992 | 66596522 | 100% | f | 41.891370 | -87.628410 | Entire rental unit | Entire home/apt | 5 | 1.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89137 -87.62841) | 564.779711 | 638.518824 | 18015.025569 | 9766.813293 | 5001.565470 | 113 | 233.972222 | 36 |
| 3378 | 969768212770420822 | 235678953 | 96% | t | 41.887883 | -87.643916 | Entire rental unit | Entire home/apt | 6 | 2.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.88788 -87.64392) | 2287.656478 | 1111.788461 | 16295.362898 | 8042.588536 | 6719.130615 | 95 | 250.666667 | 36 |
| 3391 | 973015093684752148 | 67751745 | 99% | f | 41.896538 | -87.638062 | Entire rental unit | Entire home/apt | 6 | 2.0 | ... | POINT (41.79240081830397 -87.58362020107282) | POINT (41.89654 -87.63806) | 1637.883057 | 464.577391 | 16942.304100 | 8693.678335 | 6072.897973 | 112 | 243.555556 | 9 |
2426 rows × 40 columns
df9 = df9.dropna(subset=['host_id', 'accommodates', 'bedrooms', 'price2', 'review_scores_rating','host_is_superhost2'])
We now have to scale numeric variables to be between 0 and 1, and turn strings into exclusive dummy variables as we did in the ML handout. There will be one difference, however. Instead of creating boolean variables for the property_type, we'll create binary variables. This is, instead of the variable saying True or False, it should have a 1 or a 0. We'll do this because some methods do not handle boolean variables well. To do this, use np.where() as follows.
Assume you have a dataframe called dX that has a column called type that takes three possible values: a, b, and c. Then, you would create three binary indicators called, for example dX['da'], dX['db'], and dX['dc'] by doing dX['da'] = np.where(dX['type'] == 'a', 1, 0) where the order of the arguments says "if the condition dX['type'] == 'a' is true, then put a 1 in the new variable dX['da'] and a 0 otherwise. Adapt this to your case and use this to create df10. After you have done this, scale all the other numeric variables in df10 to be between 0 and 1.
Finally, add a column to df10 that will identify a price above the mean of the cluster.
Now we can turn to splitting the data.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
property_types = df9['property_type'].unique()
for prop_type in property_types:
df9[f'is_{prop_type}'] = np.where(df9['property_type'] == prop_type, 1, 0)
df10 = df9.drop(columns=['property_type'])
numeric_columns = df10.select_dtypes(include=[np.number]).columns
scaler = MinMaxScaler()
df10[numeric_columns] = scaler.fit_transform(df10[numeric_columns])
df10
| id | host_id | host_acceptance_rate | host_is_superhost | latitude | longitude | room_type | accommodates | bedrooms | beds | ... | distance_to_garfield_conservatory | distance_to_science_museum | cluster_id | mean_price | number_of_listings | is_Entire rental unit | is_Entire townhouse | is_Entire guesthouse | is_Entire serviced apartment | is_Entire bungalow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000e+00 | 0.000030 | 95% | t | 0.672639 | 0.559980 | Entire home/apt | 0.066667 | 0.000000 | 0.000000 | ... | 0.213154 | 0.366124 | 0.413276 | 0.271975 | 0.305085 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 3.924913e-15 | 0.000058 | 95% | t | 0.700683 | 0.695847 | Entire home/apt | 0.200000 | 0.166667 | 0.066667 | ... | 0.455644 | 0.212760 | 0.265525 | 0.451739 | 0.406780 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1.927308e-14 | 0.000187 | 90% | t | 0.736161 | 0.642307 | Entire home/apt | 0.333333 | 0.166667 | 0.133333 | ... | 0.360182 | 0.273327 | 0.044968 | 0.635785 | 0.152542 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 8.974578e-14 | 0.000943 | 92% | t | 0.884853 | 0.610728 | Entire home/apt | 0.000000 | 0.000000 | 0.000000 | ... | 0.304577 | 0.309678 | 0.109208 | 0.197877 | 0.050847 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1.423750e-13 | 0.001279 | 99% | f | 0.786342 | 0.556617 | Entire home/apt | 0.466667 | 0.333333 | 0.133333 | ... | 0.207618 | 0.370288 | 0.012848 | 0.440906 | 0.474576 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3358 | 9.927576e-01 | 0.974363 | 99% | f | 0.818199 | 0.656566 | Entire home/apt | 0.133333 | 0.000000 | 0.066667 | ... | 0.385885 | 0.257666 | 0.062099 | 0.256499 | 0.593220 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3365 | 9.942143e-01 | 0.456341 | 100% | f | 0.548786 | 0.735848 | Entire home/apt | 0.333333 | 0.166667 | 0.133333 | ... | 0.527017 | 0.166943 | 0.271949 | 0.750442 | 0.779661 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3367 | 9.944334e-01 | 0.125044 | 100% | f | 0.644621 | 0.734185 | Entire home/apt | 0.266667 | 0.000000 | 0.133333 | ... | 0.524015 | 0.169246 | 0.239829 | 0.662302 | 0.576271 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3378 | 9.966631e-01 | 0.442528 | 96% | t | 0.635125 | 0.682038 | Entire home/apt | 0.333333 | 0.166667 | 0.133333 | ... | 0.430945 | 0.228050 | 0.201285 | 0.733192 | 0.576271 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3391 | 1.000000e+00 | 0.127213 | 99% | f | 0.658694 | 0.701724 | Entire home/apt | 0.333333 | 0.166667 | 0.066667 | ... | 0.466090 | 0.205925 | 0.237687 | 0.702996 | 0.118644 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2426 rows × 44 columns
property_types = df9['property_type'].unique()
for prop_type in property_types:
df9[f'is_{prop_type}'] = np.where(df9['property_type'] == prop_type, 1, 0)
df10 = df9.drop(columns=['property_type'])
train_df, test_df = train_test_split(df10, test_size = 0.2, random_state = 1)
We'll estimate the default Ridge, Lasso, Decision Tree, and Random Forest regressors, and compute the relevant performance measure for each of them using 10-K fold cross-validation. Then we'll use a grid of parameters to find the best specification for each of them. Finally, we'll use a Decision Tree and a Random Forest classifier to predict whether the price of a listing will be above the mean of the cluster.
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.tree import export_graphviz
df.columns
Index(['id', 'name', 'host_id', 'host_since', 'host_location',
'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
'room_type', 'accommodates', 'bedrooms', 'beds', 'price',
'minimum_nights', 'maximum_nights', 'number_of_reviews',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'price2', 'host_acceptance_rate2',
'host_is_superhost2'],
dtype='object')
train_df.columns
Index(['id', 'host_id', 'host_acceptance_rate', 'host_is_superhost',
'latitude', 'longitude', 'room_type', 'accommodates', 'bedrooms',
'beds', 'price', 'minimum_nights', 'maximum_nights',
'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_communication', 'review_scores_location',
'review_scores_value', 'price2', 'host_is_superhost2', 'total_listings',
'average_price', 'superhost_status', 'time', 'the_bean',
'lincoln_park_zoo', 'botanic_garden', 'garfield_conservatory',
'science_museum', 'point', 'distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum',
'cluster_id', 'mean_price', 'number_of_listings',
'is_Entire rental unit', 'is_Entire townhouse', 'is_Entire guesthouse',
'is_Entire serviced apartment', 'is_Entire bungalow'],
dtype='object')
train_df
| id | host_id | host_acceptance_rate | host_is_superhost | latitude | longitude | room_type | accommodates | bedrooms | beds | ... | distance_to_garfield_conservatory | distance_to_science_museum | cluster_id | mean_price | number_of_listings | is_Entire rental unit | is_Entire townhouse | is_Entire guesthouse | is_Entire serviced apartment | is_Entire bungalow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10945 | 33004 | 95% | t | 41.911960 | -87.639810 | Entire home/apt | 4 | 2.0 | 2.0 | ... | 8500.159874 | 6272.529685 | 125 | 184.384615 | 26 | 1 | 0 | 0 | 0 | 0 |
| 458 | 22886177 | 2658212 | 88% | t | 41.942980 | -87.647620 | Entire home/apt | 4 | 2.0 | 2.0 | ... | 7635.563022 | 7150.458225 | 39 | 115.756757 | 37 | 1 | 0 | 0 | 0 | 0 |
| 1152 | 45952602 | 1969829 | 87% | t | 41.855245 | -87.646794 | Entire home/apt | 6 | 2.0 | 2.0 | ... | 7723.450909 | 7030.640056 | 136 | 163.4 | 5 | 1 | 0 | 0 | 0 | 0 |
| 1794 | 590591161012149829 | 9923391 | 98% | t | 41.921867 | -87.653091 | Entire home/apt | 4 | 1.0 | 2.0 | ... | 7024.582473 | 7747.925388 | 22 | 227.727273 | 11 | 1 | 0 | 0 | 0 | 0 |
| 1243 | 48466374 | 51946578 | 100% | f | 41.918230 | -87.736340 | Entire home/apt | 6 | 3.0 | 3.0 | ... | 2240.184813 | 16991.262821 | 232 | 101.666667 | 3 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1302 | 49230136 | 166918192 | 90% | t | 41.935600 | -87.642410 | Entire home/apt | 2 | 1.0 | 1.0 | ... | 8213.514279 | 6570.674704 | 20 | 103.5 | 26 | 1 | 0 | 0 | 0 | 0 |
| 1215 | 47875647 | 148973907 | 89% | t | 41.952570 | -87.677560 | Entire home/apt | 2 | 1.0 | 1.0 | ... | 4312.968446 | 10471.537515 | 13 | 147.0 | 9 | 1 | 0 | 0 | 0 | 0 |
| 1484 | 51741865 | 233682638 | 100% | t | 41.946620 | -87.659110 | Entire home/apt | 5 | 2.0 | 3.0 | ... | 6359.608544 | 8424.193052 | 34 | 205.962963 | 27 | 1 | 0 | 0 | 0 | 0 |
| 310 | 17399529 | 33127842 | 58% | t | 41.885240 | -87.619110 | Entire home/apt | 6 | 2.0 | 3.0 | ... | 10800.847254 | 3969.864512 | 104 | 200.575758 | 33 | 0 | 0 | 0 | 1 | 0 |
| 1435 | 51171320 | 21451629 | 63% | t | 41.899930 | -87.670470 | Entire home/apt | 6 | 3.0 | 3.0 | ... | 5090.484721 | 9669.933358 | 198 | 178.285714 | 28 | 0 | 1 | 0 | 0 | 0 |
1940 rows × 44 columns
fix_col = {
'is_Entire rental unit': 'd1',
'is_Entire townhouse': 'd2',
'is_Entire guesthouse': 'd3',
'is_Entire serviced apartment': 'd4',
'is_Entire bungalow': 'd5',
}
train_df = train_df.rename(columns=fix_col)
train_df
| id | host_id | host_acceptance_rate | host_is_superhost | latitude | longitude | room_type | accommodates | bedrooms | beds | ... | distance_to_garfield_conservatory | distance_to_science_museum | cluster_id | mean_price | number_of_listings | d1 | d2 | d3 | d4 | d5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10945 | 33004 | 95% | t | 41.911960 | -87.639810 | Entire home/apt | 4 | 2.0 | 2.0 | ... | 8500.159874 | 6272.529685 | 125 | 184.384615 | 26 | 1 | 0 | 0 | 0 | 0 |
| 458 | 22886177 | 2658212 | 88% | t | 41.942980 | -87.647620 | Entire home/apt | 4 | 2.0 | 2.0 | ... | 7635.563022 | 7150.458225 | 39 | 115.756757 | 37 | 1 | 0 | 0 | 0 | 0 |
| 1152 | 45952602 | 1969829 | 87% | t | 41.855245 | -87.646794 | Entire home/apt | 6 | 2.0 | 2.0 | ... | 7723.450909 | 7030.640056 | 136 | 163.4 | 5 | 1 | 0 | 0 | 0 | 0 |
| 1794 | 590591161012149829 | 9923391 | 98% | t | 41.921867 | -87.653091 | Entire home/apt | 4 | 1.0 | 2.0 | ... | 7024.582473 | 7747.925388 | 22 | 227.727273 | 11 | 1 | 0 | 0 | 0 | 0 |
| 1243 | 48466374 | 51946578 | 100% | f | 41.918230 | -87.736340 | Entire home/apt | 6 | 3.0 | 3.0 | ... | 2240.184813 | 16991.262821 | 232 | 101.666667 | 3 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1302 | 49230136 | 166918192 | 90% | t | 41.935600 | -87.642410 | Entire home/apt | 2 | 1.0 | 1.0 | ... | 8213.514279 | 6570.674704 | 20 | 103.5 | 26 | 1 | 0 | 0 | 0 | 0 |
| 1215 | 47875647 | 148973907 | 89% | t | 41.952570 | -87.677560 | Entire home/apt | 2 | 1.0 | 1.0 | ... | 4312.968446 | 10471.537515 | 13 | 147.0 | 9 | 1 | 0 | 0 | 0 | 0 |
| 1484 | 51741865 | 233682638 | 100% | t | 41.946620 | -87.659110 | Entire home/apt | 5 | 2.0 | 3.0 | ... | 6359.608544 | 8424.193052 | 34 | 205.962963 | 27 | 1 | 0 | 0 | 0 | 0 |
| 310 | 17399529 | 33127842 | 58% | t | 41.885240 | -87.619110 | Entire home/apt | 6 | 2.0 | 3.0 | ... | 10800.847254 | 3969.864512 | 104 | 200.575758 | 33 | 0 | 0 | 0 | 1 | 0 |
| 1435 | 51171320 | 21451629 | 63% | t | 41.899930 | -87.670470 | Entire home/apt | 6 | 3.0 | 3.0 | ... | 5090.484721 | 9669.933358 | 198 | 178.285714 | 28 | 0 | 1 | 0 | 0 | 0 |
1940 rows × 44 columns
fix_col = {
'is_Entire rental unit': 'd1',
'is_Entire townhouse': 'd2',
'is_Entire guesthouse': 'd3',
'is_Entire serviced apartment': 'd4',
'is_Entire bungalow': 'd5',
}
# Rename the columns in 'train_df'
test_df = test_df.rename(columns=fix_col)
# Note that this will only work if your variables are named the same as mine. Adapt this to your setting.
# Note: the 'dist_loc*' variables correspond to the variables that contain the distance from listings to the touristic locations 1, 2, etc.
label_train = train_df['price2']
features_train = train_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price','d1', 'd2', 'd3',
'd4', 'd5','distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum']]
label_test = test_df['price2']
features_test = test_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price', 'd1', 'd2', 'd3',
'd4', 'd5','distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum']]
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
dep_var = train_df['price2'].copy() # This is normally called "labels"
indep_var = train_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price','d1', 'd2', 'd3',
'd4', 'd5','distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum']]
dep_var_test = test_df['price2'].copy()
indep_var_test = test_df[['beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews',
'review_scores_rating', 'host_is_superhost2', 'number_of_listings','mean_price', 'd1', 'd2', 'd3',
'd4', 'd5','distance_to_the_bean',
'distance_to_lincoln_park_zoo', 'distance_to_botanic_garden',
'distance_to_garfield_conservatory', 'distance_to_science_museum']]
indep_var = indep_var.dropna()
dep_var = dep_var.dropna()
dep_var = dep_var.loc[indep_var.index]
indep_var_test = indep_var_test.dropna()
dep_var_test = dep_var_test.dropna()
ridge_reg = Ridge(alpha=0.3)
ridge_reg.fit(indep_var, dep_var)
Ridge(alpha=0.3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Ridge(alpha=0.3)
lasso_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
lasso_reg_cv = Lasso(alpha=1)
lasso_scores = cross_val_score(lasso_reg_cv, indep_var, dep_var, scoring='neg_mean_squared_error', cv=lasso_cv, n_jobs=-1)
# Take the absolute value of the scores
lasso_scores = np.abs(lasso_scores)
print('Mean MSE: %.3f (%.3f)' % (np.mean(lasso_scores), np.std(lasso_scores)))
Mean MSE: 3733.334 (440.457)
lasso_reg_cv2 = Lasso(alpha=0.0005)
lasso_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
lasso_scores2 = cross_val_score(lasso_reg_cv2, indep_var, dep_var, scoring='neg_mean_squared_error', cv=lasso_cv, n_jobs=-1)
lasso_scores2 = np.abs(lasso_scores2)
print('Mean MSE: %.3f (%.3f)' % (np.mean(lasso_scores2), np.std(lasso_scores2)))
Mean MSE: 3711.178 (413.325)
from sklearn.model_selection import GridSearchCV
# Note that we do not include alpha when declaring the model!
lasso_reg_cv3 = Lasso()
# This one is the same as before so we don't really need to re-define it.
lasso_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Define grid
grid = dict()
grid['alpha'] = np.arange(0.001, 1, 0.01)
# Define the search
search = GridSearchCV(lasso_reg_cv3, grid, scoring='neg_mean_squared_error', cv=lasso_cv, n_jobs=-1)
# Perform the search
results = search.fit(indep_var, dep_var)
# summarize
print('MSE: %.3f' % np.abs(results.best_score_))
print('Config: %s' % results.best_params_)
MSE: 3710.192
Config: {'alpha': 0.17099999999999999}
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.803e+04, tolerance: 1.275e+03 model = cd_fast.enet_coordinate_descent(
# Note that we do not include alpha when declaring the model!
ridge_reg_cv = Ridge()
# This one is the same as before so we don't really need to re-define it.
ridge_cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Define grid
grid = dict()
grid['alpha'] = np.arange(0.001, 1, 0.01)
# Define the search
search = GridSearchCV(ridge_reg_cv, grid, scoring='neg_mean_squared_error', cv=ridge_cv, n_jobs=-1)
# Perform the search
resultsR = search.fit(indep_var, dep_var)
# summarize
print('MSE: %.3f' % np.abs(resultsR.best_score_))
print('Config: %s' % resultsR.best_params_)
MSE: 3710.779
Config: {'alpha': 0.9909999999999999}
# Define and train a default Lasso regression model
lasso_reg = Lasso()
lasso_reg.fit(indep_var, dep_var)
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.852e+03, tolerance: 1.275e+03 model = cd_fast.enet_coordinate_descent(
Lasso()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso()
lasso_prices = lasso_reg.predict(indep_var_test)
# Define and train a default Decision Tree regression model
dtree_reg = DecisionTreeRegressor()
dtree_reg.fit(indep_var, dep_var)
DecisionTreeRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor()
dtree_prices = dtree_reg.predict(indep_var_test)
fig, axs = plt.subplots(figsize=(15,6))
sns.scatterplot(x = dep_var, y = dtree_reg.predict(indep_var), label = r'Decision tree', alpha = 0.3)
plt.show()
# Define and train a default Random Forests regression model
random_forest_reg = RandomForestRegressor()
random_forest_reg.fit(indep_var, dep_var)
RandomForestRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# Use cross validation for the four models you estimated above.
# Ridge Regression
ridge_reg = Ridge()
ridge_scores = cross_val_score(ridge_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
ridge_rmse_scores = np.sqrt(-ridge_scores)
print("Ridge Regression RMSE:", ridge_rmse_scores.mean())
# Lasso Regression
lasso_reg = Lasso()
lasso_scores = cross_val_score(lasso_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
lasso_rmse_scores = np.sqrt(-lasso_scores)
print("Lasso Regression RMSE:", lasso_rmse_scores.mean())
# Decision Tree Regression
decision_tree_reg = DecisionTreeRegressor()
tree_scores = cross_val_score(decision_tree_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
tree_rmse_scores = np.sqrt(-tree_scores)
print("Decision Tree Regression RMSE:", tree_rmse_scores.mean())
# Random Forest Regression
random_forest_reg = RandomForestRegressor()
forest_scores = cross_val_score(random_forest_reg, indep_var, dep_var, scoring='neg_mean_squared_error', cv=cv)
forest_rmse_scores = np.sqrt(-forest_scores)
print("Random Forest Regression RMSE:", forest_rmse_scores.mean())
Ridge Regression RMSE: 60.82055418742339
C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.018e+03, tolerance: 1.150e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.702e+03, tolerance: 1.144e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.954e+03, tolerance: 1.163e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.030e+03, tolerance: 1.120e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.502e+04, tolerance: 1.133e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.547e+03, tolerance: 1.140e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.193e+03, tolerance: 1.168e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.310e+03, tolerance: 1.165e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.966e+03, tolerance: 1.147e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.167e+03, tolerance: 1.144e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.039e+03, tolerance: 1.166e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.463e+03, tolerance: 1.158e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.063e+04, tolerance: 1.140e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.805e+03, tolerance: 1.140e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.234e+03, tolerance: 1.144e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.054e+04, tolerance: 1.154e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.899e+03, tolerance: 1.114e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.112e+03, tolerance: 1.158e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.546e+03, tolerance: 1.178e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.154e+04, tolerance: 1.145e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.572e+03, tolerance: 1.176e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.571e+03, tolerance: 1.129e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.334e+03, tolerance: 1.164e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.624e+03, tolerance: 1.123e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.049e+03, tolerance: 1.142e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.519e+03, tolerance: 1.137e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.981e+03, tolerance: 1.126e+03 model = cd_fast.enet_coordinate_descent( C:\ProgramData\anaconda3\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:628: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.214e+03, tolerance: 1.158e+03 model = cd_fast.enet_coordinate_descent(
Lasso Regression RMSE: 60.99349634514588 Decision Tree Regression RMSE: 79.69735363221432 Random Forest Regression RMSE: 58.26056643681418
train and test data, as well as df6 and df10.¶train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)
df6.to_csv('df6.csv', index=False)
df10.to_csv('df10.csv', index=False)
File -> Save and export notebook as -> HTML) and upload the resulting HTML file to Canvas. Do not upload your notebook (not the ipynb file but the html file)¶